687 research outputs found
Optical Flow in Mostly Rigid Scenes
The optical flow of natural scenes is a combination of the motion of the
observer and the independent motion of objects. Existing algorithms typically
focus on either recovering motion and structure under the assumption of a
purely static world or optical flow for general unconstrained scenes. We
combine these approaches in an optical flow algorithm that estimates an
explicit segmentation of moving objects from appearance and physical
constraints. In static regions we take advantage of strong constraints to
jointly estimate the camera motion and the 3D structure of the scene over
multiple frames. This allows us to also regularize the structure instead of the
motion. Our formulation uses a Plane+Parallax framework, which works even under
small baselines, and reduces the motion estimation to a one-dimensional search
problem, resulting in more accurate estimation. In moving regions the flow is
treated as unconstrained, and computed with an existing optical flow method.
The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art
results on both the MPI-Sintel and KITTI-2015 benchmarks.Comment: 15 pages, 10 figures; accepted for publication at CVPR 201
Recommended from our members
Long Range Motion Estimation and Applications
Finding correspondences between images underlies many computer vision problems, such as op- tical flow, tracking, stereovision and alignment. Finding these correspondences involves formulating a matching function and optimizing it. This optimization process is often gradient descent, which avoids exhaustive search, but relies on the assumption of being in the basin of attraction of the right local minimum. This is often the case when the displacement is small, and current methods obtain very accurate results for small motions.
However, when the motion is large and the matching function is abrupt this assumption is less likely to be true. One traditional way of avoiding this abruptness is to smooth the matching function spatially by blurring the images. As the displacement becomes larger, the amount of blur required to smooth the matching function becomes also larger. This averaging of pixels leads to a loss of detail in the image. Therefore, there is a trade-off between the size of the objects that can be tracked and the displacement that can be captured.
In this thesis we address the basic problem of increasing the size of the basin of attraction in a matching function. We use an image descriptor called distribution fields (DFs). By blurring the images in DF space instead of in pixel space, we increase the size of the basin attraction with respect to traditional methods. We show competitive results using DFs both in object tracking and optical flow. Finally we demonstrate an application of capturing large motions for temporal video stitching
SMART Frame Selection for Action Recognition
Action recognition is computationally expensive. In this paper, we address
the problem of frame selection to improve the accuracy of action recognition.
In particular, we show that selecting good frames helps in action recognition
performance even in the trimmed videos domain. Recent work has successfully
leveraged frame selection for long, untrimmed videos, where much of the content
is not relevant, and easy to discard. In this work, however, we focus on the
more standard short, trimmed action recognition problem. We argue that good
frame selection can not only reduce the computational cost of action
recognition but also increase the accuracy by getting rid of frames that are
hard to classify. In contrast to previous work, we propose a method that
instead of selecting frames by considering one at a time, considers them
jointly. This results in a more efficient selection, where good frames are more
effectively distributed over the video, like snapshots that tell a story. We
call the proposed frame selection SMART and we test it in combination with
different backbone architectures and on multiple benchmarks (Kinetics,
Something-something, UCF101). We show that the SMART frame selection
consistently improves the accuracy compared to other frame selection strategies
while reducing the computational cost by a factor of 4 to 10 times.
Additionally, we show that when the primary goal is recognition performance,
our selection strategy can improve over recent state-of-the-art models and
frame selection strategies on various benchmarks (UCF101, HMDB51, FCVID, and
ActivityNet).Comment: To be published in AAAI-2
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
The goal of this work is to understand the way actions are performed in
videos. That is, given a video, we aim to predict an adverb indicating a
modification applied to the action (e.g. cut "finely"). We cast this problem as
a regression task. We measure textual relationships between verbs and adverbs
to generate a regression target representing the action change we aim to learn.
We test our approach on a range of datasets and achieve state-of-the-art
results on both adverb prediction and antonym classification. Furthermore, we
outperform previous work when we lift two commonly assumed conditions: the
availability of action labels during testing and the pairing of adverbs as
antonyms. Existing datasets for adverb recognition are either noisy, which
makes learning difficult, or contain actions whose appearance is not influenced
by adverbs, which makes evaluation less reliable. To address this, we collect a
new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional
recipes videos, curating a set of actions that exhibit meaningful visual
changes when performed differently. Videos in AIR are more tightly trimmed and
were manually reviewed by multiple annotators to ensure high labelling quality.
Results show that models learn better from AIR given its cleaner videos. At the
same time, adverb prediction on AIR is challenging, demonstrating that there is
considerable room for improvement.Comment: CVPR 23. Code and dataset available at
https://github.com/dmoltisanti/air-cvpr2
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos
Understanding the steps required to perform a task is an important skill for
AI systems. Learning these steps from instructional videos involves two
subproblems: (i) identifying the temporal boundary of sequentially occurring
segments and (ii) summarizing these steps in natural language. We refer to this
task as Procedure Segmentation and Summarization (PSS). In this paper, we take
a closer look at PSS and propose three fundamental improvements over current
methods. The segmentation task is critical, as generating a correct summary
requires each step of the procedure to be correctly identified. However,
current segmentation metrics often overestimate the segmentation quality
because they do not consider the temporal order of segments. In our first
contribution, we propose a new segmentation metric that takes into account the
order of segments, giving a more reliable measure of the accuracy of a given
predicted segmentation. Current PSS methods are typically trained by proposing
segments, matching them with the ground truth and computing a loss. However,
much like segmentation metrics, existing matching algorithms do not consider
the temporal order of the mapping between candidate segments and the ground
truth. In our second contribution, we propose a matching algorithm that
constrains the temporal order of segment mapping, and is also differentiable.
Lastly, we introduce multi-modal feature training for PSS, which further
improves segmentation. We evaluate our approach on two instructional video
datasets (YouCook2 and Tasty) and observe an improvement over the
state-of-the-art of and for procedure segmentation and
summarization, respectively.Comment: Accepted at BMVC 202
CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
Zero-shot action recognition is the task of recognizing action classes
without visual examples, only with a semantic embedding which relates unseen to
seen classes. The problem can be seen as learning a function which generalizes
well to instances of unseen classes without losing discrimination between
classes. Neural networks can model the complex boundaries between visual
classes, which explains their success as supervised models. However, in
zero-shot learning, these highly specialized class boundaries may not transfer
well from seen to unseen classes. In this paper, we propose a clustering-based
model, which considers all training samples at once, instead of optimizing for
each instance individually. We optimize the clustering using Reinforcement
Learning which we show is critical for our approach to work. We call the
proposed method CLASTER and observe that it consistently improves over the
state-of-the-art in all standard datasets, UCF101, HMDB51, and Olympic Sports;
both in the standard zero-shot evaluation and the generalized zero-shot
learning
- …